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Abstract 

We consider the problem of discriminative fac¬ 
tor analysis for data that are in general non- 
Gaussian. A Bayesian model based on the 
ranks of the data is proposed. We first intro¬ 
duce a new max-margin version of the rank- 
likelihood. A discriminative factor model is then 
developed, integrating the max-margin rank- 
likelihood and (linear) Bayesian support vec¬ 
tor machines, which are also built on the 
max-margin principle. The discriminative fac¬ 
tor model is further extended to the nonlin¬ 
ear case through mixtures of local linear classi¬ 
fiers, via Dirichlet processes. Fully local con- 
jugacy of the model yields efficient inference 
with both Markov Chain Monte Carlo and varia¬ 
tional Bayes approaches. Extensive experiments 
on benchmark and real data demonstrate superior 
performance of the proposed model and its po¬ 
tential for applications in computational biology. 

1. Introduction 

Modern applications in computational biology and bioin¬ 
formatics routinely involve data coming from different 
sources, measured and quantified in different ways, e.g., 
averaged intensities in gene expression, cytokines and pro- 
teomics, or fragment counts in RNA and microRNA se¬ 
quencing. However, they all share a common trait; data are 
rarely Gaussian, and they are often discrete, the latter due to 
digital technologies used for quantification. Nevertheless, 
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a large proportion of statistical analyses performed on these 
data assume Gaussianity in one way or another. This is be¬ 
cause customary preprocessing pipelines employ normal¬ 
ization and/or domain transformation approaches aimed at 
making the data as Gaussian, or at least as symmetric, as 
possible. For example, one popular yet simple strategy for 
RNA sequencing data is to rescale each sample to correct 
for technical variability, followed by log-transformation or 
quantile normalization (Dillies et al., 2013). This and many 
other examples have the same rationale: the data transfor¬ 
mations are order preserving, while also trying to achieve 
a desired distribution, typically Gaussian. Figure 1 shows 



Figure 1. Intuition behind data modeling with ranks. Top and 
right panels are group-wise empirical distributions for rank(x) 
and log(x), respectively. 


expression for a particular gene and two phenotypes (ac¬ 
tive and latent tuberculosis), from the dataset described in 
Section 4.2. The horizontal and vertical axes show respec¬ 
tively log(x) (log-transformed) and rank(x) (ranked) gene 
intensities. We see that from either axis we could derive 
a decision rule to separate the two groups, so that we can 
predict the label of new data, without worrying too much 
about the actual values or scaling of the axes. In fact, from 
their group-wise empirical distributions, we see that log(x) 
and rank(x) seem to have nearly the same optimal deci¬ 
sion rule. Note that in general we are not required to log- 
transform the data, and in principle we could either use raw 
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data or any other monotone transformation of the gene in¬ 
tensities while still being able to build a classifier, if the 
data support it. Motivated by this fact, and also by the 
success of standard nonparametric statistical approaches 
based on ranks, such as Spearman’s rank correlation and 
Wilcoxon rank-sum test (Lehmann & D’Abrera, 2006), in 
this paper, we propose a new discriminative factor model 
based directly on the ranks of observed data as opposed to 
their values. This new model enjoys three significant bene¬ 
fits: 

(1) Since we do not model actual values, we could treat 
ordinal, continuous and discrete data within the same 
framework. 

(2) We can in principle make weaker assumptions about 
the distribution of the data. 

(3) We can jointly identify subsets of variables with similar 
(rank) correlation structures, some of which might be 
able to (partially) separate different classes and could 
be combined to build a classification model. 

These advantages come with the price of not being able 
to account for the actual values of the data, which is not 
such a big disadvantage, as long as we are only interested 
in parameters of the model involving relative differences or 
similarities between elements of a given dataset (which is 
typically the case when building classifiers). 

Modeling with ranks is not a new idea, in fact Pettitt (1982) 
presented a linear regression model using a likelihood func¬ 
tion based on the ranks of observed data, coined by the au¬ 
thors as rank-likelihood. More recently it was also used by 
Hoff (2007) to estimate the correlation matrix of data from 
disparate types, e.g., binary, discrete and continuous. Here 
we employ the rank-likelihood as a building block for a dis¬ 
criminative factor model, with the ultimate goal of being 
able to jointly perform feature extraction and classification 
while decreasing the effort required to preprocess raw data. 

The contributions of this paper are three-fold: 

(1) We introduce a new max-margin version of the rank- 
likelihood geared towards Bayesian factor modeling, 
and we present a novel data augmentation scheme that 
allows for fast inference due to local conjugacy. 

(2) We propose a discriminative factor model by integrat¬ 
ing max-margin rank-likelihood, (linear) Bayesian sup¬ 
port vector machines (SVMs) and global-local shrink¬ 
age priors. One key feature of our model is that likeli¬ 
hood functions for both data and labels have the max- 
margin property. 

(3) We extend the discriminative factor model to nonlin¬ 
ear decision functions, through a mixture of local lin¬ 
ear classifiers implemented via a Dirichlet process im¬ 
posed on the latent space of the factor model. 


Experiments on benchmark and real data, namely USPS, 
MNIST, gene expression and RNA sequencing, demon¬ 
strate that the proposed model often performs better than 
competing approaches. Results on the real data demon¬ 
strate the potential of our model for applications in com¬ 
putational biology, not only for well-established, high- 
throughput technologies such as gene expression and 
metabolomics, but also in emerging ones such as RNA se¬ 
quencing and proteomics. 

2. Max-margin rank-likelihood 

Ordinal probit model Consider N data samples, each 
a d-dimensional vector with ordinal values; the discussion 
of ordinal data helps to motivate and explain the model, 
which is subsequently generalized to real-valued data. The 
data are represented by the dxN matrix X, the nth column 
of which represents the nth data vector. Let Xi^n represent 
element (z, n) of X, corresponding to component i of the 
nth data vector, modeled as 

Xi,n = giiWi^n) , ( 1 ) 

Vi^ji , Vi^n ^ *^(0; 1) ; 

where A = [ai ... G is the factor loadings 

matrix with AT factors, the factor scores for all JV data sam¬ 
ples are represented by Z = [zi ... zjv] G Wi^n is 

element (z, n) of W G and gi(-) is a non-decreasing 

function (such that the rankings of the JV realizations of 
component z are preserved). Specifically, large values in 
(rows of X) correspond to large values in (rows of W). 

Assume that component z of each data vector takes val¬ 
ues in the set Ji}. Then function pi(-) can be 

fully specified by — 1 ordered parameters < ■ ■ ■ < 
/zi j._i, often called “cut points”, yielding 

X^,n = = J if < hij , (2) 

where hi^o = —oo, lii,Ji = oo and = [hi,i ■ ■ ■ hij.-i] 

is a vector of thresholds for the zth row of W. Equations ( 1 ) 
and (2) define a typical probit factor model for ordinal vec¬ 
tor data (Hoff, 2009). 

Inferring W for the model in (1) and (2) is not com¬ 
plicated, because its conditional posterior corresponds 
to a truncated Gaussian distributions, i.e., Wi^n 

where < Kj 
for j = Xi^n- Nevertheless, the model has three impor¬ 
tant shortcomings: (z) Specifying a prior distribution for 
{hi}iLi might be difficult because often such information 
is not available to the practitioner; (zz) if Ji is large, the 
number of parameters of the model that need to be esti¬ 
mated increases substantially, making prior specification 
and inference harder; (zzz) sampling from a truncated Gaus¬ 
sian distribution can be relatively expensive, and may be 
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prone to numerical instabilities, especially when samples 
lie near the tails of the distribution. 


Rank-likelihood model Provided that gi{wi^n) is non¬ 
decreasing by assumption, we know that given Xi^n < 
Xi^n', then gi{wi^n) < gi{wi^n') and Wi^n < thus 

^ ^ if ^ : (3) 

where i?(xi) is the set of all possible vectors such 
that rank(xi) = rank(wi). Given x^, since neither 
nor p{wi G i?(xi)) depend on pi(wi), we can formu¬ 
late inference for A and Z directly in terms of G 
R{Ki) through the rank-likelihood representation p(wi G 
i?(xi)|A, Z) (Pettitt, 1982). Specifically, we can write 
the joint probability distribution of the model above as 


nf=iP(Wze R(xi), ai,Z) = 


p(A)p(Z) TT I /" TT A/'(u;i,„;a7z„, I . (4) 


The integrals to the right hand side of (4) are in gen¬ 
eral difficult to compute. However, it is not necessary to 
do so, because we can estimate the posterior of parame¬ 
ters {A, Z, W} via Gibbs sampling, by iteratively cycling 
through their conditional posterior distributions. For A and 
Z we need to sample fromp(Z|W, A) andp(A|W, Z), re¬ 
spectively, where we instantiate W such that Wj G R{xi) 
for i = 1,... ,d. For we only need to be able to 
sample from p(wi G R(xi)jai, Z). We can write 

p([wi,n e R(xi),ai,Z), where contains all 

elements from Wi but Wi^n- From (1) and (3), Wi^n is Gaus¬ 
sian and restricted to the set R{xi), respectively. Condi¬ 
tioning on we have 


p(Wi,n|Wi\„,ai,Z„) 


a„ Zn) 

TJV{aJzn,l,wl„,wlJ , 


where w-„ = Taax{wi^n' ■ Xi^n' < Xi^n] and = 
minimi : Xi^n < which jointly guarantee that 

K.n Wi\„] e R{x{). Note that given 
is conditionally independent of Wi\{wl and also 

that the conditional posteriors for the ordered probit and 
rank-likelihood based models are identical except that for 
the former, constraints for Wi^n come from as op¬ 

posed to thresholds h^. In fact, the rank-likelihood model 
can be seen as an alternative to the ordered probit model 
in which the threshold variables have been marginalized 
out (Hoff, 2009). It is important to point out that in ap¬ 
plications when the connection between observed and la¬ 
tent variables, gi{wi), is of interest, the rank-likelihood is 
not applicable. Fortunately, in factor models we are usu¬ 
ally only interested in A and Z, the loadings and the factor 
scores, respectively (see Murray et al. (2013), for example). 


->-f £[ (loss function) 

7^- o Kn 

o Wi^n 


^ X 

^i,n ^i,u 

Figure 2. Graphical representation of the loss function associ¬ 
ated to the max-margin rank-likelihood in (6), where = 

4(w*,n-w,>)-l-fe(u’',„-Wi.n)- Note that -|-£ < < 

— e is not penalized by the loss function. 

Max-margin rank-likelihood One disadvantage of the 
rank-likelihood model is that differences between elements 
of Wi can be arbitrarily small, as there is no mechanism 
in the prior distribution of Wj to prevent this from happen¬ 
ing. In the ordered probit model we can do so via the prior 
for the thresholds h^, however this is not necessarily easily 
accomplished. Fortunately, for the rank-likelihood we can 
alleviate this issue by modifying (3) as 

f^mm(x^) {Wj G M . Wi ji G C If G , (5) 



where we have made explicit that any two distinct elements 
of Wi must be separated by a gap of size no smaller than 
e > 0. Furthermore, max{wi^„/ : xi^n' < xi^n} + e < 
Wi^n < min{u;i_„/ : Xi^n < — e- From (5) we can 

write a pseudo-likelihood for Wi „ as 


G(“'i,n|Wi\„) = exp - U)“„) - : (6) 


where Wi^n = sjZn, = a^z", = a^z^ and 

£e{u) = 2max(0,u -f e) can be interpreted as the “one¬ 
sided” e-sensitive loss. From (6) this means that £e{u) > 
0 only if Wi^n < -f e or Wi^n > it also 

means that this loss function does not penalize ru* „ -f e < 
Wi^n < — e and e is called the margin. See Figure 2 for 

a graphical representation of the proposed composite loss 
function. Maximizing (6) is equivalent to finding G 
i?mm(xi) such that differences between neighbor elements 
of Wi are maximal given e, hence the term max-margin is 
used. 

Recent work by Poison & Scott (2011b) has shown 
that £e{u) admits a location-scale mixture of Gaussians, 
specifically, they showed that exp{—2max(0, it)} = 
/ Af{u; —A, X)dX, thus we can rewrite (6) as 


Li{'Wi,n\''^i\n) J 'X’i.rn ^ ^i,n) 

X - Wi,n; -e - Xl„, Xl„)dXl„dXl„ , (7) 

where •) is the density function of a Gaussian dis¬ 
tribution, and we have introduced two sets of latent vari¬ 
ables {A“„} and {A* „}. This data augmentation scheme 
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implies that the pseudo-likelihood before marginalization, 
Kn’ K n)’ Conjugate to a Gaussian distri¬ 
bution, just as in the original rank-likelihood formulation, 
but without the difficulty of truncated Gaussians, because 
Wi^n is now exactly Zn, not a random variable. Note that 
as a result of this, we have transferred the uncertainty be¬ 
tween the rank of Xi^n and the factorization z„ from Wi^n 
in the rank-likelihood (and the ordered probit) to the set of 
location-scale parameters in our max-margin 

formulation. 

Discrete and continuous data So far we have assumed 
that we have ordinal data (the cut points of the ordinal 
model help explain the associated mechanics of the rank- 
likelihood model). However, we can also use the rank- 
likelihood with discrete or continuous data, as long as we 
acknowledge that likelihood and posteriors derived from 
them only contain information about the order of the ob¬ 
servations and not their actual values. 

In general terms, factor models are concerned with learn¬ 
ing about the covariance structure of observed data via a 
low rank matrix decomposition, AZ. In this sense, the role 
of the likelihood is to define the way in which covariances 
are measured. This means that one important difference be¬ 
tween standard Gaussian and rank-likelihood based factor 
models is that they use different notions of covariance; very 
much in the same spirit of the differences between Pearson 
and Spearman correlations. Another important difference 
is the generative mechanism implied by the likelihood. In 
the rank-likelihood, we can generate a statistic, namely the 
rank, based on a sample population but not their values. 
This happens because we ignore the part of the model that 
links the statistic with actual data, specifically 

p{xi\ai, Z, hi) = p(rank(xi), Xi|ai, Z, h^) 

= p(rank(xi)|ai, Z)p(xj|rank(xi), a^, Z, h^). 


We infer a^ and Z strictly from p(rank(xi)|ai, Z), the 
marginal likelihood, via G Since we ignore 

p(xi|rank(xi), ai, Z, hi), we do not infer the thresholds 
hi, thus in the strictest sense we are not using all informa¬ 
tion provided by Xi, i.e., its values, however we are assum¬ 
ing that ranks alone contain enough information to be able 
to characterize the covariance structure of the data so we 
can reliably estimate A and Z. Additional examples and 
further discussion of Bayesian analysis employing similar 
marginal likelihood strategies can be found in Monahan & 
Boos (1992). 


3. Bayesian SVM based discriminative factor 
model 

When the data being analyzed belong to two different 
classes, encoded as {—1,1}, labels y = [yi ... S 


{—1,1} will encourage our factor model to learn discrim¬ 
inative features (loadings and scores) from the data, then 
these features can be used to make predictions for new data. 
This modeling approach is commonly known as super¬ 
vised dictionary learning or discriminative factor analysis 
(Mairal et al., 2008). From a Bayesian perspective, factor 
models and probit/logit link based classifiers have been al¬ 
ready successfully combined; see for instance Salazar et al. 
(2012); Quadrianto et al. (2013). 

Unlike previous work, we continue with the max-margin 
theme and develop a supervised factor model using 
Bayesian support vector machines (SVMs). The same re¬ 
sult from Poison & Scott (201 lb) used above to derive the 
max-margin rank-likelihood provides a pseudo-likelihood 
for the hinge loss, traditionally employed in the context of 
SVMs (Poison & Scott, 201 lb). Specifically, 


L„{yn\l3,Zn) = exp{-2max(0,w„)} 



/ 1 jUn + XnY \ 
'v 2 A)) ; 


d\ 


C 

n f 


( 8 ) 


where = 1 — ynf3^Zn, [3 € is a vector of clas¬ 
sifier coefficients and {A))} is a vector of latent variables, 
with superscript c denoting the classifier. In Poison & Scott 
(2011b) covariates, z„, are observed while here they are 
latent variables (factor scores) that need to be estimated 
jointly with the remaining parameters of a factor model. It 
has been shown empirically that linear margin-based classi¬ 
fiers, SVMs being a special case, often perform better than 
those using logit or probit links (Poison & Scott, 2011a; 
Henao et al., 2014). 

Interestingly, in our factor model the max-margin mech¬ 
anism plays two roles, i.e., data and labels are both con¬ 
nected to the factor model core via max-margin pseudo¬ 
likelihoods: rank-likelihood for the data and hinge loss for 
the labels. Furthermore, for the loadings, since shrinkage 
for A is usually a requirement for interpretability or when 
N d, here we use a three-parameter-beta normal prior 
(TVBN) (Armagan et al., 2011), a fairly general global- 
local shrinkage prior (Poison & Scott, 2010), for which 
it has been demonstrated that it has better mixing proper¬ 
ties than priors such as spike-slab (Carvalho et al., 2010). 
Shrinkage for the elements of ^ is also employed, because 
it allows us to identify the features of A that contribute to 
the classification task. Intuitively, we can see A as a dic¬ 
tionary with K features, each feature explaining a subset of 
the input variables due to shrinkage; via separate shrinkage 
within the model, (3 selects from the K features to build 
a predictor for labels y. Being able to specify global and 
local properties independently makes the TVBAf prior at¬ 
tractive for high-dimensional settings, such as gene expres¬ 
sion and RNA sequencing, which are precisely the types of 
data we will focus on in our experiments. 
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Linear discriminative factor model By imposing the 
max-margin rank-likelihood construction in (5) to X and 
the hinge loss to y via pseudo-likelihoods in (7) and ( 8 ), 
respectively, one possible prior specification for the super¬ 
vised factor model parameterized by {A, Z, (3} is 

a,,fc ^ z„ ~ AA( 0 ,Ik), 

where are global shrinkage parameters for load¬ 

ings A and classifier coefficients (3. Furthermore, for the 
TV BN prior we can write 


0^i,k 

^Ni0T^,k), 




- Ga(ra,r7i,/c) , 

h 

~ Ga(r;3,efc), 

‘Hi.k ^ 


^k 

~Ga(s;3 ,$(«). 

Setting Ta 

= Sa = 5 (for (3, it 

is 

= sp = ^), a special 


case of TV BN corresponds to the widely known horse¬ 
shoe prior (Carvalho et al., 2010). Note that each column 
of the loadings has a different global shrinkage parame¬ 
ter thus allowing them to have different degrees of 
shrinkage. We can also infer (and by letting 

~ Ga(^, l>) and $ ~ Ga(^, 1). As a result of hav¬ 
ing individual shrinkage parameters for each column of A, 
we could say that the prior is capable of “turning off” un¬ 
necessary factors, hence having an automatic relevance de¬ 
termination flavor to it (MacKay, 1995; Wipf & Nagara- 
jan, 2008). This is indeed the behavior we see in practice; 
there are other ways to select the number of factors, e.g., 
by adding a multiplicative gamma prior to matrix A (Bhat- 
tacharya & Dunson, 2011). 

Non-Linear discriminative factor model When the la¬ 
tent space for factor scores, z„, is not linearly separable, 
nonlinear classification approaches might be more appro¬ 
priate. One may use a kernel to extend the linear SVM 
to its nonlinear counterpart. However, from a Bayesian 
factor modeling perspective, adding kernel-based nonlin¬ 
ear classifiers is nontrivial, because they tend to make in¬ 
ference complicated and computationally expensive due to 
loss of conjugacy for the parameters involved in the nonlin¬ 
ear component of the model, i.e., the kernel function. From 
a different perspective, it is still possible to build a global 
nonlinear decision rule as a mixture of local linear classi¬ 
fiers (Shahbaba & Neal, 2009; Fu et al., 2010). The basic 
idea is to assume factor scores as coming from a mixture 
model, in which each mixture component has an associated 
local linear Bayesian SVM. Here we use a Dirichlet pro¬ 
cess (DP) in its stick-breaking construction (Sethuraman, 
2001 ), represented as 


where 9 * = 1 ’ represents a point measure at 

9l and a is the concentration parameter. Applied to our 
model, factor scores and labels are drawn from a paramet¬ 
ric model t/n, ^ f{dn) with parameters 0 „, where ^ 
G. For G as in (9) and a finite number of samples N, many 
of the {un, z„} share the same parameters, therefore mak¬ 
ing {un, z„} a draw from a mixture model. Specifically, we 
make /( 0 „) = z„)A/'(z„|/r„, Go = 

TVBNmrp,s0,3>(P^) X N{ti\0,lK) X Ga(V'|V's, V'r) 

and {f3„, VtNt} sample n belongs to 

the f-th component of the mixture. In practice we truncate 
the sum in (9) to T terms to make inference easier (Ish- 
waran & James, 2001) and set tps = 1-1 and N = 0.001 
(i.e., a non-informative prior). 

Predictions Making predictions for new data using our 
model is conceptually simple. We use the pair {y, X} to 
estimate the parameters of the model (training), namely 
{A, Z, (3}, then given a test point x*, we go through three 
steps; (i) Compare x* to X to determine the rank of each 
component of x* w.r.t. to the training data, which amounts 
to finding w“*}, for * = 1,... ,d. (ii) For fixed 

{A,wl rc"*}, estimate z* from its conditional posterior. 
(in) Make a prediction for x* using sign(/3^z*). 

The first step of this prediction process is exclusive to the 
proposed rank-likelihood model, and implies that we are 
required to keep the training data in order to make predic¬ 
tions. This is in the same spirit of supervised kernel meth¬ 
ods, in the sense that predictions are a function of the data 
used to fit the model (Scholkopf & Smola, 2001). Note, 
however, that for a single component of a test point, Xi *, 
we only need two elements of the training set: the two ele¬ 
ments from Xi closest to from above and below, which 
is closely related to the fc-nearest neighbor paradigm (rather 
than k nearest neighbors, we only require the two training 
neighbors “to the left and right” of a test data component). 
Intuitively, what our model does at prediction is to find a 
latent representation z* such that x* is in between but as 
far as possible from its upper and lower bounds w.r.t. to X. 
This is a very unique characteristic of our model. 

Inference Due to fully local conjugacy, we can write the 
conditional posterior distribution for all parameters of our 
model in closed form, making Markov Chain Monte Carlo 
(MCMC) inference based on Gibbs sampling a straightfor¬ 
ward procedure. Space limitations prevent us from pre¬ 
senting the complete set of conditionals, however below 
we show expressions for the parameters involving the max- 
margin rank-likelihood in (7), namely A and Z. For con¬ 
venience, we denote 

-p _ 2/n/3/c[l “t“ Ayj VniP 


G — St” 1 , 

vt ^ Beta(l, a), 


(9) 


A?;.; 


(/3 


{K,n) ^ ^ > 

'^n i^k^k.n ; 


9t = - ^i) > 

0: ~ Go, 
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In the following conditional posterior-distributions, 
refers to the conditioning parameters of the distributions. 

Sampling A: 


p{ai,k\ •) = A/'(/ra._,,cra, J, 


^2 


N 


v-1 

;,n ’ 


2 \ {1^) 
Pa^.k = > 


,(fe) 


Sampling Z; 


P{Zk,n\ ■) •^{Pzk,nT^Zk,ri') ’ 

^-2 _ -1 I 2 \-l I il 

Z^i=l ^i,k^i,n A“ > 


where 


\{k) _ ( "'U 

“ I 



k^i,k^k^n^i,n ■ 


Conditional posteriors for the remaining parameters: 

{K,u^ K,n> K, and can 

be found in Poison & Scott (2011b) and Armagan et al. 
(2011), respectively. In our experiments we set e = 
0.05, however a conjugate prior (gamma distribution) ex¬ 
ists hence e can be inferred if desired. Inference details for 
the DP specification for the factor scores can be found for 
instance in Ishwaran & James (2001); Neal (2000). 


In applications where speed is important, we can use all 
conditional posteriors including those above to derive a 
variational Bayes (VB) inference algorithm for our model, 
which loosely amounts to replacing the conditioning on 
variables with their corresponding moments. Details of the 
conditionals are not shown here for brevity, and details of 
the VB procedure are found in the Supplementary Material. 


Other related work For ordinal data, Xu et al. (2013) 
presented a factor model using the ordered probit mech¬ 
anism, but in which the probit link is replaced with a 
max-margin pseudo-likelihood. Inference is very efficient, 
but they still have to infer the thresholds {hi}. How¬ 
ever, in their collaborative prediction applications, vari¬ 
ables only take one of six possible values. For count data. 
Chib et al. (1998) proposed a generalized-linear-model in¬ 
spired Bayesian model for Poisson regression, that can be 
easily extended to a factor model. However, expensive 
Metropolis-Hastings sampling algorithms need to be used, 
due to the non-conjugacy between the prior for A and the 
log link. More recently, Zhou et al. (2012) presented a 
novel formulation of Poisson factor analysis (PFA), based 
on the beta-negative binomial process, for which inference 
is efficient. Furthermore, none of the approaches discussed 


Table 1. Composition of different methods. 


Method 

Likelihood 

Classifier 

DPM 

G-L-Probit 

Gaussian 

probit 

No 

G-L-BSVM 

Gaussian 

BSVM 

No 

OR-L-Probit 

ordinary rank 

probit 

No 

OR-L-BSVM 

ordinary rank 

BSVM 

No 

R-L-BSVM 

max-margin rank 

BSVM 

No 

G-NL-BSVM 

Gaussian 

BSVM 

Yes 

R-NL-BSVM 

max-margin rank 

BSVM 

Yes 


above consider discriminative factors models, and for PFA 
this is very difficult, because in that case the prior for 
the factor scores is not a Gaussian distribution and is thus 
not conjugate to the SVM or probit-based likelihoods. As 
a result, building discriminative factor models under that 
framework is challenging, at least not without Metropolis- 
Hastings style inference. 

4. Experiments 

We present numerical results on two benchmark (USPS 
and MNIST) and two real (gene expression and RNA se¬ 
quencing) datasets, using part of or all methods summa¬ 
rized in Table 1; inference is performed via VB. The data 
likelihood can be either Gaussian, rank or the max-margin 
rank-likelihood. The labels (classifier) can be modeled us¬ 
ing the probit link or the Bayesian SVM (BSVM) pseudo¬ 
likelihood. When the DP mixture (DPM) model is used, 
the classifier results in a nonlinear (NL) decision function. 
Everywhere we set A = 20, T = 5 and performance mea¬ 
sures were averaged over 5 repetitions (standard deviations 
are also presented). We verified empirically that further in¬ 
creasing K or T does not significantly change the outcome 
of any of our models. All code used in the experiments was 
written in Matlab and executed on a 3.3GHz desktop with 
16Gb RAM. 

In the following experiments we focus on comparing dis¬ 
criminative factor models against each other to show how 
each component of the model contributes to the end per¬ 
formance produced by our full model. In particular, we 
show that the Bayesian SVM, max-margin rank-likelihood 
and nonlinear decision function all improve the overall per¬ 
formance on their own, when compared to standard ap¬ 
proaches such as probit regression and Gaussian models on 
log-transformed data. It is important to take into consider¬ 
ation that our model is at the same time trying to explain 
the data and to build a classifier via a linear latent represen¬ 
tation of ranks, thus we will not attempt to match results 
obtained by more sophisticated state-of-the-art classifica¬ 
tion models (e.g., a nonlinear SVM applied directly to raw 
data may yield a good classifier, but it does not afford the 
generative interpretability of a factor model, the latter par¬ 
ticularly relevant to biological applications). Our model is 
ultimately trying to find a good balance between covari- 
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Table 2. Mean error rates (%) and runtime in seconds for the test data of the USPS 3 vs. 5 subtask. 



G-L-Probit 

G-L-BSVM 

OR-L-Probit 

OR-L-BSVM 

R-L-BSVM 

G-NL-BSVM 

R-NL-BSVM 

Error 

Runtime 

5.95±0.005 

8.64 

5.86±0.008 

10.29 

5.05±0.013 

14.07 

4.92±0.027 

14.19 

4.53±0.026 

16.05 

3.88±0.017 

23.81 

3.23±0.025 

36.63 


Table 3. Mean error rates (%) and runtime in seconds for the test data of the MNIST 3 vs. 5 subtask. 



G-L-BSVM 

R-L-BSVM 

L-SVM 

G-NL-BSVM 

R-NL-BSVM 

NL-SVM 

Error 

5.05±0.053 

4.84±0.014 

4.68 

4.21±0.010 

2.10±0.007 

2.00 

Runtime 

150 

220 

140 

400 

600 

304 


ance structure modeling, interpretability through shrinkage 
and classification performance. All of this is done with the 
very important additional benefit of not requiring distribu¬ 
tional assumptions about the values of the data, as this in¬ 
formation is usually not known in practice (as in the sub¬ 
sequent biological experiments below). However, in those 
cases where the distribution is known a priori it should cer¬ 
tainly be reflected in likelihood function. 

4.1. Handwritten digits 

Digitized images are a good example of essentially non- 
Gaussian data traditionally modeled using Gaussian noise 
models in the context of factor models and dictionary learn¬ 
ing (Mairal et al., 2008). However, depending on pre¬ 
processing, they can be naturally treated either as contin¬ 
uous variables representing pixel intensities when filter¬ 
ing/smoothing is pre-applied, or as discrete variables rep¬ 
resenting pixel values when raw data is available. Our run¬ 
ning hypothesis here is that a rank-likelihood representa¬ 
tion for pixels is more expressive than its Gaussian coun¬ 
terpart. Intuitively, a discriminative factor model should be 
able to find features (subsets of representative pixels) that 
separate image subtypes. However, without the Gaussian 
assumption for observations, our rank model might be able 
to adapt to more general conditions, e.g., skewed or heavy¬ 
tailed distributions. In our results we use classification er¬ 
ror on a test set as a quantitative measure of performance. 


USPS First we consider the models in Table 1 to the well 
known 3 vs. 5 subtask of the USPS handwritten digits 
dataset. It consists of 1540 smoothed gray scale 16 x 16 
images rescaled to fit within the [—1,1] interval. Each ob¬ 
servation is a 256-dimensional vector of scaled pixel inten¬ 
sities. Here we use the resampled version, which is 767 
images for model fitting and the remaining 773 for testing. 
Results in Table 2 show that consistently: rank-likelihood 
based models outperform Gaussian models, BSVM out¬ 
performs the probit link, and nonlinear outperforms lin¬ 
ear classifiers. Furthermore, the proposed max-margin rank 
likelihood model performs best in both variants, linear and 
nonlinear. In every case inference took less than 1 minute. 


MNIST Next we consider the same 3 vs. 5 task, this time 
on a larger dataset, the MNIST database. The dataset is 
composed by 11552 training images and 1902 test images. 
Unlike USPS, MNIST consists of 28x 28 raw 8-bit encoded 
images, so each observation is a 784-dimensional vector of 
pixel values (discrete values from {0,..., 255}). Results 
for four out of six methods from Table 1 are summarized 
in Table 3. Results for probit based models were not as 
good as those for BSVM, thus not showed here, nor in the 
upcoming experiments. Instead, we include results for a 
linear (L-SVM) and nonlinear (NL-SVM) SVM with RBF 
kernel directly applied to the data as baselines, without the 
factor model. From Table 3 we see that the proposed model 
works better than the Gaussian model, and that the results 
for R-NF-BSVM are close to that for NL-SVM. This is not 
surprising, as a pure classification model (e.g., NL-SVM) 
does not attempt to explain the data but only to maximize 
classification performance. In this case, the most expen¬ 
sive approach, namely R-NL-BSVM has a runtime in the 
neighborhood of 10 minutes which is deemed acceptable 
considering the size of the dataset. Visualizations of the 
factor loadings. A, learned by various models are presented 
in the Supplementary Material. 

4.2. Gene expression 

We applied our model to a newly published tuberculosis 
study from Anderson et. al. (2014), consisting of gene 
expression intensities for 47323 genes and 334 subjects 
(GEO accession series GSE39941). These subjects can be 
partitioned in three phenotypes: active tuberculosis (TB) 
(111), latent TB (54) and other infectious diseases (169), 
and in whether they are positive (107) or negative (227) 
for HIV. The raw data were preprocessed with background 
correction, sample-wise scaling and gene filtering. For the 
analysis we keep the top 4732 genes with largest inten¬ 
sity profiles. Results for three binary classification tasks 
using a one vs. the rest scheme, the HIV classifier and 
10-fold cross-validation are summarized in Table 4 (error 
bars for accuracies omitted due to space limitations). We 
also present area under the ROC curve (AUC) to account 
for subset imbalance. As an additional baseline, we in¬ 
cluded a Poisson Factor Analysis (PFA) model (Zhou et al.. 
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Table 4. AUC (with error bars), accuracy and runtime in seconds for gene expression data. 


Methods 

PFA-L-BSVM 

G-L-BSVM 

R-L-BSVM 

G-NL-BSVM 

R-NL-BSVM 

TB vs. Others 
Active TB vs. Others 
Latent TB vs. Others 
HIV(+) vs. HIV(-) 
One fold time 

0.740±0.102, 0.683 
0.802±0.070, 0.775 
0.849±0.051,0.802 
0.850±0.056, 0.793 
130 

0.766±0.093, 0.704 
0.857±0.050, 0.784 
0.907±0.037, 0.841 
0.879±0.055, 0.844 
141 

0.814±0.052, 0.740 
0.896±0.028, 0.832 
0.9233=0.041, 0.868 
0.9003=0.055, 0.856 
180 

0.8473=0.061,0.778 
0.921±0.034, 0.853 
0.9343=0.029, 0.874 
0.915±0.041, 0.850 
330 

0.8723=0.025, 0.781 
0.948=t0.021, 0.880 
0.9543=0.025, 0.889 
0.9593=0.051, 0.901 

450 


2012) with Bayesian SVMs, as a 2-step procedure. For 
the Gaussian models we log-transform intensities, and for 
PFA we round them to the closest integer value (raw inten¬ 
sities become floating point values after background cor¬ 
rection and scaling). We can see that our models outper¬ 
form the others in each of the classification subtasks, and 
R-NL-BSVM performs the best overall with a reasonable 
computational cost. It is important to mention that we 
are not building separate discriminative factor models for 
each subtask, instead a single factor model jointly learns 
the four predictors, meaning that all classifiers share the 
same loadings and factor scores. As a result, our model 
operates here as a multi-tasking learning scheme. Figure 3 



gene expression data. 


wise rescaling to compensate for differences in library size, 
log-transform for Gaussian models and rounding for PFA. 
Subjects are split into three different groups, namely sys¬ 
temic inflammatory response (SIRS) (26), sepsis survivors 
(SeS) (78) and sepsis complications leading to death (SeD) 
(29). Three binary comparisons are the main interest of the 
study: SIRS vs (all) sepsis, SeS vs. SeD and SIRS vs. SeS. 
Being able to classify this sub-groupings is important for 
two reasons: (i) these tasks are known to be hard classifi¬ 
cation problems, and (ii) a recently published study by Liu 
et al. (2014) showed that approximately 40% of hospital 
mortality is sepsis related. Classification results for 10-fold 
cross-validation including AUC, accuracy and runtime per 
fold are summarized in Table 5. Once again our model per¬ 
forms the best. We also tried the nonlinear version of our 
model but figures were omitted due to very minor improve¬ 
ments in performance. 


Table 5. AUC, accuracy and runtime(s) for RNAseq data. 


Methods 
SIRS vs. Se 
SeD vs. SeS 
SIRS vs. SeS 
One fold time 


PFA-L-BSVM 


G-L-BSVM 


R-L-BSVM 


0.70±0.04, 0.73 
0.76±0.05, 0.70 
0.75±0.02, 0.71 
179 


0.78±0.02, 0.76 
0.76±0.01, 0.75 
0.87±0.01, 0.71 
175 


0 . 86 ± 0 . 01 , 0.81 
0 . 82 ± 0 . 02 , 0.78 
0 . 91 ± 0 . 01 , 0.87 

226 


5. Conclusion 


shows the coefficients of the classifier learned using R-L- 
BSVM. The leading coefficients reveal that certain factors 
are key to different classes. For instance, factor 14 is spe¬ 
cific to TB vs. others, factors 1 and 5 are specific to TB 
vs. latent TB, and factors 9 and 4 are specific to TB vs. 
others including HIV(-i-). We performed a pathway as¬ 
sociation analysis using DAVID (Huang et al., 2009) on 
the top 200 genes from each factor. We found interesting 
associations. Factor 14: ubiquitin-protein, ligase activity 
and immunodeficiency. Factor 9: immune response, lym- 
phocite/leukocite/T cell activation and apoptosis. Factor 5: 
proteasome complex, response to stress, response to antibi¬ 
otic. Factor 4: ribonucleoprotein, proteasome, ubiquitin- 
protein, ligase activity. The complete gene lists and the 
inferred gene networks are provided in the Supplementary 
Material. 

4.3. RNA sequencing 

Finally, we consider a new RNA sequencing (RNAseq) 
sepsis study (?). The dataset consists of 133 subjects 
and 15158 genes (after removing genes with more than 
25% zero entries). Data preprocessing consists of sample- 


We have developed a Bayesian discriminative factor model 
for data that are generally non-Gaussian. This is achieved 
via the integration of a new max-margin rank likelihood, 
Bayesian support vector machines, global-local shrinkage 
priors, and a Dirichlet process mixture model. The pro¬ 
posed model is built on the ranks of the data, opening the 
door to treat ordinal, continuous and discrete data (e.g., 
count data) within the same framework. Experiments have 
demonstrated that the proposed factor model achieves bet¬ 
ter performance than widely used log-transformed-plus- 
Gaussian models and a Poisson model, on both gene ex¬ 
pression and RNA sequencing data. These results highlight 
the potential of the proposed model in a variety of applica¬ 
tions, especially computational biology. 

Our rank based models are relatively more computation¬ 
ally expensive than Gaussian models on log-transformed 
data. However, in applications such as gene expression 
or sequencing that constitute the real data used in our ex¬ 
periments, runtimes are still significantly lower when com¬ 
pared to the time needed to generate the data. For biologi¬ 
cal studies, the quality and interpretability of the results are 
of paramount importance, with speed a secondary issue. 
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6. Model 


In the following conditional posterior-distributions, 
refers to the conditioning parameters of the distribu¬ 
tions, IG(a, 5) denotes the inverse Gaussian distribution, 
Ga(a, b) the gamma distribution, and GIG(a, b,p) the gen¬ 
eralized inverse Gaussian distribution. 

In the linear case, when the Dirichlet process mixture 
(DPM) model is not used, the conditional posterior distri¬ 
butions are; 


The full Bayesian model is: 

Xi,n= g%{wi^n), t = l,...,(i; n = l,...,iV; 

Li{Wi^ri\'^i\n) = J ~ ~ ^i,ni ^i,n) 

X -e - >'\,n)dKnd>'\,n , 

f°° 1 

Ln{yn\f^,'^n) = / , 

Jq y 


1. A; 
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Ft 
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^k 
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~ Ga(l/2,1). 


7. MCMC inference 

For convenience, we denote 
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P((A“„)-'|-) =IG(K„ + e-<J-\l) , 
f((A)))-'|-) =IG(|l-y„/3^z„|-i,l) . 




Max-Margin Rank-Likelihood 


4. /3: 


1. A: 


N ^2 
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M/5fc 




2/n -2/c,r^ 


n—1 


1 H“ Ayj Vnif^ ^n)\fc 


AS 


p(&fc|-) = GIG(2e/c,/3fc,r/3 - 0.5), 

p(efe|-) = Ga(r/3 + S/3,6fc + , 

p( 4>(/5)|.) = Ga , 

p(|.(/3)|.) = Ga(l,4>(^) + ll 


In the nonlinear case, when the DPM is used; 

1 . t{n) (mixture component index for n-th observation): 

p{t{n) =t\-) (X qtM{zn;tit,ip^^lK) ■ 

2. DPM parameters; 

p(/it.fe|-) =1+ H V-t: 

n:t{n)—t 






ZkA 


‘'K,n , 

n:t(n)—t 


p(V'tl-) = Ga +0.5A, r/'r + 0.5 ^/i? 
p(i/t|-) = Beta I 1 + ^ 1, a+ ^ 1 

\ n:t{n)=t n;t(n)>i y 

/ T-1 N 

p(q;|-) = Ga a* + T - 1, ^ log(l - i^t) 




In this case, /3 in 2) and 4) should be replaced by for 
t — 1,... ,T, and the summation over n in 4) will only 
account for {n : t{n) = t}. 

8. VB inference 

Since the model is fully local conjugate, the VB update 
equations can be obtained using the moments of the above 
conditional posterior distributions. Here we present the 
moments for the model without DPM, and for the VB in¬ 
ference of the DP mixture model, please refer to (Blei & 
Jordan, 2005). In the following expressions, (•) denotes 
expectation, JCp( ) is the modihed Bessel function of the 
second kind, = (a7)(z„), = {aij){zl) and 

«n) = (a7)(Zn)- 


N 


O-r.k) = ^{Zk,n){^^!^l), 

71 — 1 

/ N ^ 
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9. Inferred Factor Loadings on the 
Handwritten Digits 

We plotted the factor loadings A learned from USPS and 
MNIST datasets in Figures 5 and 4, respectively. Four 
models, G-L-BSVM, R-L-BSVM, G-NL-BSVM and R- 
NL-BSVM are used as examples. It can be be seen that 
the Gaussian model is trying to learn the dictionaries to re¬ 
construct images while the rank model is trying to learning 
the differences (focusing on the edges). 


10. Results on Gene Expression Data 

We show the results of our model for gene expression data. 
K = 20 factors are used and here we only show the results 
generated by the proposed max-margin rank model without 
DP, i.e., using linear Bayesian SVM as the classifier. Fig¬ 
ure 6 shows the coefficients j3 of the learned classifiers and 
Figure 7 the inferred gene network from the learned factor 
loading matrix A. 
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I Active TB vs. Others Latent TB vs. Olliers I ITB vs. Otlieis HI\'(+) vs HIV(-) 



Figure 6. The learned classifier coefficients /3 of the 4 classifiers for the gene expression data. 
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Figure 7. The learned gene network inferred from the factor loading matrix A. 










































